Hayden Hoopes
The Spam Ham Detection dataset is a collection of text messages that have been labeled as either spam or ham. The dataset consists of a CSV file called spam.csv, which contains 5,572 text messages and their labels.
In this project, I will use the TextVectorization layer in Keras to prepare text data for binary classification on the Spam Ham Detection dataset. I will define and train a Keras model to classify text messages as either spam or ham.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('spam.csv', encoding='latin-1')
# Clean unnecessary columns and rename
df = df[['v1', 'v2']]
df = df.rename(columns={'v1': 'spam', 'v2': 'text'})
df['spam'] = df['spam'].map({'ham': 0, 'spam': 1})
labels = df['spam']
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(df['text'], labels, test_size=0.3, random_state=1)
The baseline accuracy for this model is about 86.6% as seen below in the proportion of the majority class. That means that a classifier must at least be more accurate than 86.6% in order to be considered effective.
df['spam'].value_counts(normalize=True)
0 0.865937 1 0.134063 Name: spam, dtype: float64
Below, a baseline model that uses a bag of words approach to identify the presence of some words in the data set is constructed and evaluated to create a target for sequence modeling, which will be applied later. This bag of words model uses a max vocabulary of 10,000 (10,000 features) and a single dense layer with 32 nodes. Thus, the total number of parameters in the model is 320,032 plus the output layer with 1 node that contains 32 bias terms, totalling up 320,065 parameters.
According to the Loss by Epoch graphic below, during training, the model almost immediately begins overfitting after just one or two epochs, at which point the validation loss skyrockets. This could be due to the fact that the model does not take into consideration the ordering of the terms in a body of text and the relative sparsity of vocabulary in any message may make it easy to overtrain the model to find patterns that don't really exist.
That said, the best model trained had an accuracy of about 98% on the validation set. That's definitely better than 86.6%, and is a very good result assuming that the classes have been evenly distributed.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_test)
sequences = tokenizer.texts_to_sequences(X_test)
from tensorflow.keras.layers import TextVectorization
# This vectorizer uses both unigrams and bigrams
text_vectorization = TextVectorization(max_tokens=10_000, ngrams=(1,2), output_mode='multi_hot', pad_to_max_tokens=False)
text_vectorization.adapt(X_train)
X_train_vectorized = text_vectorization(X_train)
X_val_vectorized = text_vectorization(X_val)
from tensorflow import keras
from tensorflow.keras import layers
def get_model(max_tokens=10_000, hidden_dim=32):
inputs = keras.Input(shape=(max_tokens,))
x = layers.Dense(hidden_dim, activation='relu')(inputs)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
return model
model = get_model()
model.summary()
Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) [(None, 10000)] 0 dense_2 (Dense) (None, 32) 320032 dropout_1 (Dropout) (None, 32) 0 dense_3 (Dense) (None, 1) 33 ================================================================= Total params: 320,065 Trainable params: 320,065 Non-trainable params: 0 _________________________________________________________________
callbacks = [keras.callbacks.ModelCheckpoint('bag_of_words.keras', save_best_only=True)]
history = model.fit(X_train_vectorized, y_train, validation_data=(X_val_vectorized, y_val), epochs=50, callbacks=callbacks)
Epoch 1/50 122/122 [==============================] - 1s 8ms/step - loss: 0.3299 - accuracy: 0.9236 - val_loss: 0.1541 - val_accuracy: 0.9737 Epoch 2/50 122/122 [==============================] - 1s 8ms/step - loss: 0.0995 - accuracy: 0.9800 - val_loss: 0.0937 - val_accuracy: 0.9797 Epoch 3/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0607 - accuracy: 0.9867 - val_loss: 0.0900 - val_accuracy: 0.9803 Epoch 4/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0456 - accuracy: 0.9892 - val_loss: 0.0915 - val_accuracy: 0.9791 Epoch 5/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0355 - accuracy: 0.9910 - val_loss: 0.1007 - val_accuracy: 0.9785 Epoch 6/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0345 - accuracy: 0.9926 - val_loss: 0.1028 - val_accuracy: 0.9785 Epoch 7/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0301 - accuracy: 0.9923 - val_loss: 0.1088 - val_accuracy: 0.9791 Epoch 8/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0283 - accuracy: 0.9933 - val_loss: 0.1157 - val_accuracy: 0.9797 Epoch 9/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0272 - accuracy: 0.9946 - val_loss: 0.1211 - val_accuracy: 0.9797 Epoch 10/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0219 - accuracy: 0.9951 - val_loss: 0.1251 - val_accuracy: 0.9791 Epoch 11/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0201 - accuracy: 0.9954 - val_loss: 0.1300 - val_accuracy: 0.9791 Epoch 12/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0167 - accuracy: 0.9962 - val_loss: 0.1367 - val_accuracy: 0.9791 Epoch 13/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0183 - accuracy: 0.9962 - val_loss: 0.1399 - val_accuracy: 0.9797 Epoch 14/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0150 - accuracy: 0.9964 - val_loss: 0.1431 - val_accuracy: 0.9797 Epoch 15/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0159 - accuracy: 0.9969 - val_loss: 0.1484 - val_accuracy: 0.9791 Epoch 16/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0146 - accuracy: 0.9969 - val_loss: 0.1512 - val_accuracy: 0.9791 Epoch 17/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0118 - accuracy: 0.9974 - val_loss: 0.1620 - val_accuracy: 0.9785 Epoch 18/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0118 - accuracy: 0.9974 - val_loss: 0.1700 - val_accuracy: 0.9797 Epoch 19/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0117 - accuracy: 0.9972 - val_loss: 0.1722 - val_accuracy: 0.9785 Epoch 20/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0102 - accuracy: 0.9977 - val_loss: 0.1730 - val_accuracy: 0.9785 Epoch 21/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0105 - accuracy: 0.9974 - val_loss: 0.1800 - val_accuracy: 0.9785 Epoch 22/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0092 - accuracy: 0.9979 - val_loss: 0.1802 - val_accuracy: 0.9785 Epoch 23/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0083 - accuracy: 0.9979 - val_loss: 0.1827 - val_accuracy: 0.9785 Epoch 24/50 122/122 [==============================] - 1s 8ms/step - loss: 0.0076 - accuracy: 0.9982 - val_loss: 0.1898 - val_accuracy: 0.9785 Epoch 25/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0065 - accuracy: 0.9987 - val_loss: 0.1949 - val_accuracy: 0.9785 Epoch 26/50 122/122 [==============================] - 1s 8ms/step - loss: 0.0056 - accuracy: 0.9987 - val_loss: 0.2033 - val_accuracy: 0.9797 Epoch 27/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0051 - accuracy: 0.9982 - val_loss: 0.2072 - val_accuracy: 0.9791 Epoch 28/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0051 - accuracy: 0.9990 - val_loss: 0.2145 - val_accuracy: 0.9791 Epoch 29/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0047 - accuracy: 0.9987 - val_loss: 0.2166 - val_accuracy: 0.9791 Epoch 30/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0059 - accuracy: 0.9985 - val_loss: 0.2233 - val_accuracy: 0.9791 Epoch 31/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0053 - accuracy: 0.9979 - val_loss: 0.2215 - val_accuracy: 0.9797 Epoch 32/50 122/122 [==============================] - 1s 6ms/step - loss: 0.0029 - accuracy: 0.9987 - val_loss: 0.2283 - val_accuracy: 0.9791 Epoch 33/50 122/122 [==============================] - 1s 6ms/step - loss: 0.0049 - accuracy: 0.9987 - val_loss: 0.2241 - val_accuracy: 0.9797 Epoch 34/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0030 - accuracy: 0.9992 - val_loss: 0.2370 - val_accuracy: 0.9779 Epoch 35/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0029 - accuracy: 0.9992 - val_loss: 0.2365 - val_accuracy: 0.9797 Epoch 36/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0027 - accuracy: 0.9995 - val_loss: 0.2439 - val_accuracy: 0.9791 Epoch 37/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0040 - accuracy: 0.9992 - val_loss: 0.2418 - val_accuracy: 0.9797 Epoch 38/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0036 - accuracy: 0.9990 - val_loss: 0.2466 - val_accuracy: 0.9791 Epoch 39/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.2593 - val_accuracy: 0.9779 Epoch 40/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0024 - accuracy: 0.9995 - val_loss: 0.2601 - val_accuracy: 0.9785 Epoch 41/50 122/122 [==============================] - 1s 4ms/step - loss: 0.0026 - accuracy: 0.9992 - val_loss: 0.2711 - val_accuracy: 0.9773 Epoch 42/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.2767 - val_accuracy: 0.9779 Epoch 43/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0024 - accuracy: 0.9995 - val_loss: 0.2718 - val_accuracy: 0.9785 Epoch 44/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0017 - accuracy: 0.9997 - val_loss: 0.2678 - val_accuracy: 0.9797 Epoch 45/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0021 - accuracy: 0.9995 - val_loss: 0.2746 - val_accuracy: 0.9797 Epoch 46/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0020 - accuracy: 0.9995 - val_loss: 0.2804 - val_accuracy: 0.9785 Epoch 47/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0018 - accuracy: 0.9995 - val_loss: 0.2843 - val_accuracy: 0.9791 Epoch 48/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0019 - accuracy: 0.9995 - val_loss: 0.2789 - val_accuracy: 0.9785 Epoch 49/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0034 - accuracy: 0.9992 - val_loss: 0.2835 - val_accuracy: 0.9785 Epoch 50/50 122/122 [==============================] - 1s 7ms/step - loss: 0.0015 - accuracy: 0.9997 - val_loss: 0.2921 - val_accuracy: 0.9785
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
test_model = keras.models.load_model('bag_of_words.keras')
test_model.evaluate(X_val_vectorized, y_val)
53/53 [==============================] - 0s 1ms/step - loss: 0.0900 - accuracy: 0.9803
[0.09002583473920822, 0.9802631735801697]
Next, sequence modeling is used to determine if the accuracy of detecting spam messages can be improved by taking into account the ordering of the words in a message rather than just considering the presence of words individually. This model also uses a max vocabulary of 10,000 but also uses sequences of length 1,000, meaning that the total length of any given body of text that is considered by the model in a single window is 1,000 words. The total number of parameters is 1,284,673.
The Loss by Epoch graph again shows that this model seems to be overfitting from epoch number 1. The validation loss decreases at first slightly and then increases as the epochs continue. This is surprising to me, since I expected the embedding layer to add increased dimensionality to the model that would help it recognize different patterns in the text that occur between related words. I also expected the model to need more training because of the bidirectional RNN layers that were used, which I would have assumed needed more time to adapt to the particular words of this data set and backpropogate the updated weights through the model over time.
This model had a validation accuracy of 97.9%, which is only slightly worse than the previous model. This could be due to random chance. While the difference between the validaiton accuracy of both models are negligible, it is likely safe to say that either model would perform well in a real world situation for classifying messages as spam or not.
Report the model.summary(). How many parameters does your model have? (5 points) Fit the model with 5 epochs. (5 points) (hint: history = model.fit(X_train_vectorized, y_train, validation_data= (X_val_vectorized, y_val), epochs=5, (this is going to take a while, so let's just run it for 5 epochs) callbacks=callbacks) Plot the epoch-Loss graph and comment on that (for example, where does the model starts overfitting and etc) (5 points) Report the accuracy in the validation set for the best model. (you need to load the best model from ModelCheckpoint callback) (5 points)
text_vectorization = layers.TextVectorization(max_tokens=10_000, output_sequence_length=1_000, output_mode='int')
text_vectorization.adapt(X_train)
X_train_vectorized = text_vectorization(X_train)
X_val_vectorized = text_vectorization(X_val)
inputs = keras.Input(shape=(None,), dtype='int64')
embedded = layers.Embedding(input_dim=10_000, output_dim=128, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.SimpleRNN(16))(embedded)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
Model: "model_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_4 (InputLayer) [(None, None)] 0 embedding_1 (Embedding) (None, None, 128) 1280000 bidirectional (Bidirectiona (None, 32) 4640 l) dropout_2 (Dropout) (None, 32) 0 dense_4 (Dense) (None, 1) 33 ================================================================= Total params: 1,284,673 Trainable params: 1,284,673 Non-trainable params: 0 _________________________________________________________________
callbacks = [keras.callbacks.ModelCheckpoint('sequence_modeling.keras', save_best_only=True)]
history = model.fit(X_train_vectorized, y_train, validation_data=(X_val_vectorized, y_val), epochs=5, callbacks=callbacks)
Epoch 1/5 122/122 [==============================] - 37s 289ms/step - loss: 0.3195 - accuracy: 0.8872 - val_loss: 0.1079 - val_accuracy: 0.9713 Epoch 2/5 122/122 [==============================] - 36s 297ms/step - loss: 0.0650 - accuracy: 0.9856 - val_loss: 0.0725 - val_accuracy: 0.9797 Epoch 3/5 122/122 [==============================] - 35s 290ms/step - loss: 0.0286 - accuracy: 0.9936 - val_loss: 0.0752 - val_accuracy: 0.9761 Epoch 4/5 122/122 [==============================] - 35s 287ms/step - loss: 0.0144 - accuracy: 0.9954 - val_loss: 0.0828 - val_accuracy: 0.9791 Epoch 5/5 122/122 [==============================] - 35s 291ms/step - loss: 0.0067 - accuracy: 0.9982 - val_loss: 0.0990 - val_accuracy: 0.9707
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
test_model = keras.models.load_model('sequence_modeling.keras')
test_model.evaluate(X_val_vectorized, y_val)
53/53 [==============================] - 2s 41ms/step - loss: 0.0725 - accuracy: 0.9797
[0.07249666750431061, 0.9796651005744934]
In this section, I would like to test the models using some made up message to see if the model can differentiate between spam and not spam messages.
Most of the messages it seems to have a decently good idea of how to classify as either spam or not spam. It seems like the word "claim" and the words "click" and "link" may be important for identifying spam messages.
model = keras.models.load_model('bag_of_words.keras')
messages = [
'Hey bro how r u doin?',
'I just won a contest by clicking this link',
'Your package could not be delivered, click this link',
'Have you heard the news about social security insurance',
'Claim your benefits today by clicking',
'I am running late but will be there soon!'
]
text_vectorization = TextVectorization(max_tokens=10_000, ngrams=(1,2), output_mode='multi_hot', pad_to_max_tokens=False)
text_vectorization.adapt(X_train)
input_text = text_vectorization(messages)
output = model.predict(input_text)
print(output)
1/1 [==============================] - 0s 30ms/step [[0.01035442] [0.26425034] [0.24674352] [0.05468433] [0.45137173] [0.00176566]]
This project was extremely insightful into learning about the applications of different types of neural networks for NLP classification tasks. While I expected to see the sequence modeling approach perform better than the bag of words model, the differences in performance are extremely small and both models performed extremely well.
In my opinion, one of the reasons that the sequential modeling approach did not outperform bag of words vectorization is because text messages tend to be short messages with a large variety of vocabulary where ordering may not be as important as it is in formal writing. The use of abbreviations, slang, and combinations of these features with regular words likely makes it difficult for a small RNN like the one used in this file to spot clear relationships between the ordering of certain words and spam messages.
In the future, this model could perhaps be improved by increasing the amount of vocabulary contained in the vectorizer and by increasing the number of layers in either of the models. This may allow the future models to better identify the patterns that occur in text messages that would allow them to be accurately classified as either spam or not spam.